1,949 research outputs found
Hierarchically Clustered Representation Learning
The joint optimization of representation learning and clustering in the
embedding space has experienced a breakthrough in recent years. In spite of the
advance, clustering with representation learning has been limited to flat-level
categories, which often involves cohesive clustering with a focus on instance
relations. To overcome the limitations of flat clustering, we introduce
hierarchically-clustered representation learning (HCRL), which simultaneously
optimizes representation learning and hierarchical clustering in the embedding
space. Compared with a few prior works, HCRL firstly attempts to consider a
generation of deep embeddings from every component of the hierarchy, not just
leaf components. In addition to obtaining hierarchically clustered embeddings,
we can reconstruct data by the various abstraction levels, infer the intrinsic
hierarchical structure, and learn the level-proportion features. We conducted
evaluations with image and text domains, and our quantitative analyses showed
competent likelihoods and the best accuracies compared with the baselines.Comment: 10 pages, 7 figures, Under review as a conference pape
Bivariate Beta-LSTM
Long Short-Term Memory (LSTM) infers the long term dependency through a cell
state maintained by the input and the forget gate structures, which models a
gate output as a value in [0,1] through a sigmoid function. However, due to the
graduality of the sigmoid function, the sigmoid gate is not flexible in
representing multi-modality or skewness. Besides, the previous models lack
modeling on the correlation between the gates, which would be a new method to
adopt inductive bias for a relationship between previous and current input.
This paper proposes a new gate structure with the bivariate Beta distribution.
The proposed gate structure enables probabilistic modeling on the gates within
the LSTM cell so that the modelers can customize the cell state flow with
priors and distributions. Moreover, we theoretically show the higher upper
bound of the gradient compared to the sigmoid function, and we empirically
observed that the bivariate Beta distribution gate structure provides higher
gradient values in training. We demonstrate the effectiveness of bivariate Beta
gate structure on the sentence classification, image classification, polyphonic
music modeling, and image caption generation.Comment: AAAI 202
Frequency Domain-based Dataset Distillation
This paper presents FreD, a novel parameterization method for dataset
distillation, which utilizes the frequency domain to distill a small-sized
synthetic dataset from a large-sized original dataset. Unlike conventional
approaches that focus on the spatial domain, FreD employs frequency-based
transforms to optimize the frequency representations of each data instance. By
leveraging the concentration of spatial domain information on specific
frequency components, FreD intelligently selects a subset of frequency
dimensions for optimization, leading to a significant reduction in the required
budget for synthesizing an instance. Through the selection of frequency
dimensions based on the explained variance, FreD demonstrates both theoretical
and empirical evidence of its ability to operate efficiently within a limited
budget, while better preserving the information of the original dataset
compared to conventional parameterization methods. Furthermore, based on the
orthogonal compatibility of FreD with existing methods, we confirm that FreD
consistently improves the performances of existing distillation methods over
the evaluation scenarios with different benchmark datasets. We release the code
at https://github.com/sdh0818/FreD.Comment: Accepted at NeurIPS 202
Implicit Kernel Attention
\textit{Attention} computes the dependency between representations, and it
encourages the model to focus on the important selective features.
Attention-based models, such as Transformers and graph attention networks (GAT)
are widely utilized for sequential data and graph-structured data. This paper
suggests a new interpretation and generalized structure of the attention in
Transformer and GAT. For the attention in Transformer and GAT, we derive that
the attention is a product of two parts: 1) the RBF kernel to measure the
similarity of two instances and 2) the exponential of norm to compute
the importance of individual instances. From this decomposition, we generalize
the attention in three ways. First, we propose implicit kernel attention with
an implicit kernel function, instead of manual kernel selection. Second, we
generalize norm as the norm. Third, we extend our attention to
structured multi-head attention. Our generalized attention shows better
performance on classification, translation, and regression tasks
Generalized Gumbel-Softmax Gradient Estimator for Various Discrete Random Variables
Estimating the gradients of stochastic nodes is one of the crucial research
questions in the deep generative modeling community, which enables the gradient
descent optimization on neural network parameters. This estimation problem
becomes further complex when we regard the stochastic nodes to be discrete
because pathwise derivative techniques cannot be applied. Hence, the stochastic
gradient estimation of discrete distributions requires either a score function
method or continuous relaxation of the discrete random variables. This paper
proposes a general version of the Gumbel-Softmax estimator with continuous
relaxation, and this estimator is able to relax the discreteness of probability
distributions including more diverse types, other than categorical and
Bernoulli. In detail, we utilize the truncation of discrete random variables
and the Gumbel-Softmax trick with a linear transformation for the relaxed
reparameterization. The proposed approach enables the relaxed discrete random
variable to be reparameterized and to backpropagated through a large scale
stochastic computational graph. Our experiments consist of (1) synthetic data
analyses, which show the efficacy of our methods; and (2) applications on VAE
and topic model, which demonstrate the value of the proposed estimation in
practices
- …